
there's a couple of thing that we need to know. It's the definition of the metris, build the intuition from it by performing sanity check and sensitivity test, and then characterize the metric by evaluation
Invariance metric vs Evaluation metric
Invariance metrics is performing some kind of sanity check for us before running the experiment, like checking whether the distribution is the same. It performing some consistent checking across all of our experiments, which is why it shouldn't be changed. Otherwise we have to start it all over again. Some changed because we want to increase number of users involved.
Evaluation metrics, is usually the business metris, like for example market share, number of users, or user experience metrics. Sometime it measure what previously stated as taking long time because it doesn't contain enough information, like measure user that got a job after taking MOOC. This is a difficult metrics which require special technique, as discussed in next blog.
from IPython.display import Image
Image(filename='png/1.png')
Image(filename='png/1-2.png')
Image(filename='png/2.png')
Image(filename='png/4.png')
Image(filename='png/5.png')
Image(filename='png/6.png')
Image(filename='png/7.png')
Image(filename='png/8.png')
Image(filename='png/9.png')
Image(filename='png/10.png')
Image(filename='png/11.png') #Getting a Job Long process...
Image(filename='png/12.png')
Image(filename='png/12-2.png')
Image(filename='png/13.png')
11:50분에 접속해서 12:01분에 클릭할 경우?
Def #1 (Cookie probability): For each
Image(filename='png/14.png')
Image(filename='png/15.png')
Image(filename='png/16.png')
Image(filename='png/17.png')
Image(filename='png/18.png')
Image(filename='png/19.png')
Image(filename='png/20.png')
Image(filename='png/21.png')
Image(filename='png/22.png')
Image(filename='png/23.png')
Image(filename='png/24.png')
Image(filename='png/25.png')
Image(filename='png/26.png')
Image(filename='png/27_2.png')
Image(filename='png/27.png')
Image(filename='png/28.png')
Image(filename='png/29.png') #highest : range of resolutions
Image(filename='png/30.png')
Image(filename='png/31.png')
Image(filename='png/32.png')
Image(filename='png/33.png')
import numpy as np
arr = np.array([87029, 113407, 84843, 104994, 99327, 92052, 60684])
x = arr.mean()
sd = arr.std(ddof=1)
se =sd/np.sqrt(arr.shape[0])
me = 1.96 * se
(x-me,x+me)
Image(filename='png/34.png')
Let’s talk about some common distributions that come up when you look at real user data.
For example, let’s measure the rate at which users click on a result on our search page, analogously, we could measure the average staytime on the results page before traveling to a result. In this case, you’d probably see what we call a Poisson distribution, or that the stay times would be exponentially distributed.
Another common distribution of user data is a “power-law,” Zipfian or Pareto distribution. That basically means that the probability of a more extreme value, z, decreases like 1/z (or 1/z^exponent). This distribution also comes up in other rare events such as the frequency of words in a text (the most common word is really really common compared to the next word on the list). These types of heavy-tailed distributions are common in internet data.
Finally, you may have data that is a composition of different distributions - latency often has this characteristic because users on fast internet connection form one group and users on dial-up or cell phone networks form another. Even on mobile phones you may have differences between carriers, or newer cell phones vs. older text-based displays. This forms what is called a mixture distribution that can be hard to detect or characterize well.
The key here is not to necessarily come up with a distribution to match if the answer isn’t clear - that can be helpful - but to choose summary statistics that make the most sense for what you do have. If you have a distribution that is lopsided with a very long tail, choosing the mean probably doesn’t work for you very well - and in the case of something like the Pareto, the mean may be infinite!
Suppose you run an experiment where you measure the number of visits to your homepage, and you measure 5000 visits in the control and 7000 in the experiment. Then the absolute difference is the result of subtracting one from the other, that is, 2000. The relative difference is the absolute difference divided by the control metric, that is, 40%.
Relative differences in probabilities
For probability metrics, people often use percentage points to refer to absolute differences and percentages to refer to relative differences. For example, if your control click-through-probability were 5%, and your experiment click-through-probability were 7%, the absolute difference would be 2 percentage points, and the relative difference would be 40 percent. However, sometimes people will refer to the absolute difference as a 2 percent change, so if someone gives you a percentage, it's important to clarify whether they mean a relative or absolute difference!
Directly estimate confidence interval without making any assumption of the data
For a lot of analysts, a majority of the time is spent is validating and choosing a metric compared to actually running the experiment. Being able to standardize the definitions was critical in the test. When measuring latency, are you talking about when the first byte loads and when a last byte loads. Also, for latency, the mean may not change at all. The signals (e.g. slow/fast connections or browsers) causes lumps in the distribution, and no central measure works. One needs to look at the right percentile metric. The key thing is that you are building intuition, you have to understand data, and the business, and work with the engineers to understand how the data is being captured.
A/A testing differs from A/B testing. in the sense that A/B is for A(control) and B(experiment), we want to detect changes that we care about, A/A testing is for control vs control groups. We want to identify changes that we don't know (lurking variables) that makes variability of our data.
A/A usually required to have larger sample size, as we now standard deviation (SE) minimized by quadratical increase of our sample size. This is tends to be expensive, gathering sample size. The alternative to this is using one big experiment or bootstrapping, chunking into smaller part and use an experiment on that. We don't always use bootstrap if it doesn't agree with our analytical variance. We have to use big
Image(filename='png/35.png')
Image(filename='png/36.png')
Image(filename='png/37.png')
Image(filename='png/38.png')
Image(filename='png/39.png')
a = """0.02
0.11
0.14
0.05
0.09
0.11
0.09
0.1
0.14
0.08
0.09
0.08
0.09
0.08
0.12
0.09
0.16
0.11
0.12
0.11
0.06
0.11
0.13
0.1
0.08
0.14
0.1
0.08
0.12
0.09
0.14
0.1
0.08
0.08
0.07
0.13
0.11
0.08
0.1
0.11"""
b = """0.07
0.11
0.05
0.07
0.1
0.07
0.1
0.1
0.12
0.14
0.04
0.07
0.07
0.06
0.15
0.09
0.12
0.1
0.08
0.09
0.08
0.08
0.14
0.09
0.1
0.08
0.08
0.09
0.08
0.11
0.11
0.1
0.14
0.1
0.08
0.05
0.19
0.11
0.08
0.13"""
a
b
arr_a = np.array(a.split('\n'),dtype='float32')
arr_b = np.array(b.split('\n'),dtype='float32')
arr_a
arr_b
a_sim = np.random.choice(arr_a,size=100)
b_sim = np.random.choice(arr_b,size=100)
a_sim
b_sim
diff = (arr_a - arr_b)
avg_diff = diff.mean()
std_diff = diff.std()
ME = 1.96*std_diff
(avg_diff-ME,avg_diff+ME)
#40 data points, 95% CI can be gained by subtracting 2 points at max and min (because CI two-tailed)
diff.sort()
diff2 = diff[1:-1]
(diff2.min(),diff2.max())
Image(filename='png/40.png')